63 research outputs found

    Copula Based Hierarchical Bayesian Models

    Get PDF
    The main objective of our study is to employ copula methodology to develop Bayesian hierarchical models to study the dependencies exhibited by temporal, spatial and spatio-temporal processes. We develop hierarchical models for both discrete and continuous outcomes. In doing so we expect to address the dearth of copula based Bayesian hierarchical models to study hydro-meteorological events and other physical processes yielding discrete responses. First, we present Bayesian methods of analysis for longitudinal binary outcomes using Generalized Linear Mixed models (GLMM). We allow flexible marginal association among the repeated outcomes from different time-points. An unique property of this copula-based GLMM is that if the marginal link function is integrated over the distribution of the random effects, its form remains same as that of the conditional link function. This unique property enables us to retain the physical interpretation of the fixed effects under conditional and marginal model and yield proper posterior distribution. We illustrate the performance of the posited model using real life AIDS data and demonstrate its superiority over the traditional Gaussian random effects model. We develop a semiparametric extension of our GLMM and re-analyze the data from the AIDS study. Next, we propose a general class of models to handle non-Gaussian spatial data. The proposed model can deal with geostatistical data that can accommodate skewness, tail-heaviness, multimodality. We fix the distribution of the marginal processes and induce dependence via copulas. We illustrate the superior predictive performance of our approach in modeling precipitation data as compared to other kriging variants. Thereafter, we employ mixture kernels as the copula function to accommodate non-stationary data. We demonstrate the adequacy of this non-stationary model by analyzing permeability data. In both cases we perform extensive simulation studies to investigate the performances of the posited models under misspecification. Finally, we take up the important problem of modeling multivariate extreme values with copulas. We describe, in detail, how dependences can be induced in the block maxima approach and peak over threshold approach by an extreme value copula. We prove the ability of the posited model to handle both strong and weak extremal dependence and derive the conditions for posterior propriety. We analyze the extreme precipitation events in the continental United States for the past 98 years and come up with a suite of predictive maps

    Spatio-temporal models of infectious disease with high rates of asymptomatic transmission

    Get PDF
    The surprisingly mercurial Covid-19 pandemic has highlighted the need to not only accelerate research on infectious disease, but to also study them using novel techniques and perspectives. A major contributor to the dificulty of containing the current pandemic is due to the highly asymptomatic nature of the disease. In this investigation, we develop a modeling framework to study the spatio-temporal evolution of diseases with high rates of asymptomatic transmission, and we apply this framework to a hypothetical country with mathematically tractable geography; namely, square counties uniformly organized into a rectangle. We first derive a model for the temporal dynamics of susceptible, infected, and recovered populations, which is applied at the county level. Next we use likelihood-based parameter estimation to derive temporally varying disease transmission parameters on the state-wide level. While these two methods give us some spatial structure and show the effects of behavioral and policy changes, they miss the evolution of hot zones that have caused significant difficulties in resource allocation during the current pandemic. It is evident that the distribution of cases will not be stagnantly based on the population density, as with many other diseases, but will continuously evolve. We model this as a diffusive process where the diffusivity is spatially varying based on the population distribution, and temporally varying based on the current number of simulated asymptomatic cases. With this final addition coupled to the SIR model with temporally varying transmission parameters, we capture the evolution of \hot zones in our hypothetical setup

    Design of Probabilistic Random Forests with Applications to Anticancer Drug Sensitivity Prediction- 2016

    Get PDF
    Random forests consisting of an ensemble of regression trees with equal weights are frequently used for design of predictive models. In this article, we consider an extension of the methodology by representing the regression trees in the form of probabilistic trees and analyzing the nature of heteroscedasticity. The probabilistic tree representation allows for analytical computation of confidence intervals (CIs), and the tree weight optimization is expected to provide stricter CIs with comparable performance in mean error. We approached the ensemble of probabilistic trees’ prediction from the perspectives of a mixture distribution and as a weighted sum of correlated random variables. We applied our methodology to the drug sensitivity predic- tion problem on synthetic and cancer cell line encyclopedia dataset and illustrated that tree weights can be selected to reduce the average length of the CI without increase in mean error

    S1: Supplementary Information for Article: A copula based approach for design of multivariate random forests for drug sensitivity prediction

    Get PDF
    Changes in performance with prior feature selection Random forest (RF) is designed to create uncorrelated trees using random subsets of features in each node of each tree. RF by itself is a great tool for feature selection from a high dimensional set of features. But we observed that the prediction accuracy is improved when a prior feature selection (RELIEFF) [1] approach is implemented. Table A shows the performance of RF, VMRF and CMRF with and without RELIEFF feature selection in 2 drug sets of GDSC. Performance Analysis for drugsets consisting of more 8 than two drugs We have generated empirical copulas for the bivariate cases as they are able to capture all forms of dependency structures. However, generation of empirical copulas has high computational complexity along with the need for a significant number of training samples at each node. Thus for more than two drug responses, we have considered parametric copulas and the difference between Gaussian copula parameters generated using root node and split node samples instead of the integral difference between empirical copulas is used. To test our hypothesis that VMRF and CMRF will perform better than RF, we considered a drug set with 4 different drugs from CCLE with single common target between them and a drug set with 3 different drugs in GDSC with a common target between them. The CCLE set has 482 cell lines and the GDSC set has 308 cell lines. RELIEFF was used to reduce the feature space prior to random forest application. For simplicity, in this case, we’ve used 30% of the sample cell lines as training data and 70% of them as testing data

    A Copula Based Approach for Design of Multivariate Random Forests for Drug Sensitivity Prediction

    Get PDF
    Modeling sensitivity to drugs based on genetic characterizations is a significant challenge in the area of systems medicine. Ensemble based approaches such as Random Forests have been shown to perform well in both individual sensitivity prediction studies and team science based prediction challenges. However, Random Forests generate a deterministic predictive model for each drug based on the genetic characterization of the cell lines and ignores the relationship between different drug sensitivities during model generation. This application motivates the need for generation of multivariate ensemble learning techniques that can increase prediction accuracy and improve variable importance ranking by incorporating the relationships between different output responses. In this article, we propose a novel cost criterion that captures the dissimilarity in the output response structure between the training data and node samples as the difference in the two empirical copulas. We illus- trate that copulas are suitable for capturing the multivariate structure of output responses independent of the marginal distributions and the copula based multivariate random forest framework can provide higher accuracy prediction and improved variable selection. The proposed framework has been validated on genomics of drug sensitivity for cancer and cancer cell line encyclopedia database
    • …
    corecore